4 Jensen's Inequality, Entropy

#Convex #JensenInequality #Entropy #CrossEntropy #KLDivergence #ELBO

1 Convex Function

Convex Function

A function $g : (a, b) \to R$ is convex if $\begin{matrix} (2.1) & g (λ x_{1} + (1 - λ) x_{2}) \leq λ g (x_{1}) + (1 - λ) g (x_{2}), \end{matrix}$
$\forall x_{1}, x_{2} \in (a, b)$ and $\forall λ \in [0, 1]$ .
$g$ is called strictly convex if the equality holds only for $λ = 0, 1$ , or $x_{1} = x_{2}$ . I.e., the graph of $g$ over $(a, b)$ contains no straight lines.

Pasted image 20241128122332.png|500

Now consider three points $P_{1}, P_{2}, P_{3} \in R^{2}$ . The convex hull of ${P_{1}, P_{2}, P_{3}}$ is ${(x, y) \in R^{2} | \begin{aligned} (x, y) = λ_{1} P_{1} + λ_{2} P_{2} + λ_{3} P_{3}, \\ λ_{1} + λ_{2} + λ_{3} = 1, λ_{i} \geq 0. \end{aligned}}$
This is a triangular area of the three points. If $g$ is convex, by the figure $\begin{array}{r} g (λ_{1} x_{1} + λ_{2} x_{2} + λ_{3} x_{3}) \leq λ_{1} g (x_{1}) + λ_{2} g (x_{2}) + λ_{3} g (x_{3}), \end{array}$
Pasted image 20241128123143.png|400
We can conclude for multiple points:

2 Jensen's Inequality

Theorem (Jensen's Inequality)

A convex function $g : (a, b) \to R$ satisfies $\begin{matrix} (2.2) & g (\sum_{i = 1}^{n} λ_{i} x_{i}) \leq \sum_{i = 1}^{n} λ_{i} g (x_{i}), \end{matrix}$ for all $λ_{1}, \dots, λ_{n}$ satisfying $\sum_{i = 1}^{n} λ_{i} = 1, λ_{i} \geq 0, \forall i$ .

Proof

We perform induction on $n$ .

$n = 2$ , obvious by definiton.
Assume true for $n = 2, \dots, k$ . Then for $n = k + 1$ , $\begin{aligned} g (\sum_{i = 1}^{k + 1} λ_{i} x_{i}) & = g (\sum_{i = 1}^{k} λ_{i} x_{i} + λ_{k + 1} x_{k + 1}) \\ = g ((1 - λ_{k + 1}) \frac{\sum_{i = 1}^{k} λ_{i} x_{i}}{1 - λ_{k + 1}} + λ_{k + 1} x_{k + 1}) \\ (*) & \leq (1 - λ_{k + 1}) g (\frac{\sum_{i = 1}^{k} λ_{i} x_{i}}{1 - λ_{k + 1}}) + λ_{k + 1} g (x_{k + 1}) \\ (**) & \leq (1 - λ_{k + 1}) [\sum_{i = 1}^{k} \frac{λ_{i}}{1 - λ_{k + 1}} g (x_{i})] + λ_{k + 1} g (x_{k + 1}) \\ = \sum_{i = 1}^{k + 1} λ_{i} g (x_{i}) . \end{aligned}$
((*): definition of convexity; (**): induction assumption.)

Corollary

Let $X$ be a $R -$ valued discrete RV and $g$ a convex function. Then $\begin{matrix} (2.3) & g (E [X]) \leq E [g (X)] . \end{matrix}$ if $g$ is strictly convex and $X$ is not constant, then $g (E [X]) \leq E [g (X)] .$

Proof

Let $λ_{i} = P (X = x_{i})$ in Jensen's inequality.

3 Entropy

Suppose $P, Q$ are two probability measures on $(Ω, F)$ , and $X$ a discrete RV, s.t. $\begin{array}{r} P (X = x) = p (x), x \in Range (X); \\ Q (X = x) = q (x), x \in Range (X) . \end{array}$

Shannon Entropy

$H (P) = - \sum_{x} p (x) \log p (x) = - E_{p} [\log p (X)] .$

Cross Entropy

The cross entropy of $Q$ relative to $P$ is $H (P, Q) = - \sum_{x} p (x) \log q (x) = - E_{p} [\log q (X)] .$

KL Divergence

The KL divergence (Kullback-Leibler divergence) of $P$ from $Q$ is $\begin{aligned} KL (P | | Q) & = H (P, Q) - H (P) \\ = - \sum_{x} p (x) \log [\frac{q (x)}{p (x)}] = - E_{p} [\log \frac{q (X)}{p (X)}] . \end{aligned}$

For continuous RV, replace $p (x)$ with p.d.f and $\sum_{x}$ with $\int d x$ . I.e.
In general, $KL (p | | q) \neq KL (q | | p)$ .
We can use Jensen's inequality to prove KL divergence is non-negative. See below.

To solve the problem of asymmetry, we define JS divergence:

JS Divergence

Denote $M = \frac{1}{2} (P + Q)$ , then define JS divergence $JS (P | | Q) = \frac{1}{2} KL (P | | M) + \frac{1}{2} KL (Q | | M) .$

See more discussion on entropy in Maximum Entropy Model, information gain from Decision Tree.

4 ELBO

X

: observed data.

Z

: latent variable.

q

: a distribution over

Z

. Then

KL (q | | p) = - E_{q} {\log [\frac{p (Z | X, θ)}{q (Z)}]},

then $\underset{\log -loss}{\underset{⏟}{\log p (X | θ)}} = \underset{ELBO}{\underset{⏟}{L (q, X, θ)}} + \underset{KL divergence}{\underset{⏟}{KL (q | | p)}},$ here $L (q, X, θ) = E_{q} {\log [\frac{p (X, Z | θ)}{q (Z)}]}$ is called Evidence Lower Bound (ELBO).

Claim

$KL (p | | q) \geq 0$ , and "=" holds iff $p (x) = q (x), \forall x$ .

Proof

Since $- \log$ is a convex function, by Jensen's Inequality, let $Z = \frac{q (X)}{p (X)}$ , then $\begin{matrix} (*) & - \log E_{p} [\frac{q (X)}{p (X)}] \leq - E_{p} [\log \frac{q (X)}{p (X)}] . \end{matrix}$
Since $E_{p} [\frac{q (X)}{p (X)}] = \sum_{x} p (x) \frac{q (x)}{p (x)} = \sum_{x} q (x) = 1,$ we have LHS $= 0$ .
In fact, $- \log$ is a strictly convex function, equality in (*) holds iff $p (x) = q (x), \forall x$ .

Key facts for ELBO:

$L (q, X, θ)$ is a lower bound on the log-likelihood, since by above $KL (q | | p) \geq 0, \forall p, q$ .
ELBO is much easier to compute or optimize than $\log p (X | θ)$ .